fix: make the dedup operator cover all column types#80
Merged
liulx20 merged 5 commits intoalibaba:mainfrom Mar 19, 2026
Merged
Conversation
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
shirly121
approved these changes
Mar 19, 2026
liulx20
added a commit
to liulx20/neug
that referenced
this pull request
Mar 20, 2026
* make dedup operator cover all column types * format * fix
longbinlai
added a commit
that referenced
this pull request
Mar 20, 2026
* add java sdk * add test cases * Update tools/java_driver/USAGE.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update tools/java_driver/USAGE.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix some issues * add ClientTest * update doc * fix doc * Update tools/java_driver/pom.xml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tools/java_driver/src/test/java/org/alibaba/neug/driver/InternalResultSetTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * format * rename org to com * fix doc * add result metadata * fix * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * add tests * add doc * add maven * Update tools/java_driver/src/main/java/com/alibaba/neug/driver/utils/Client.java Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalResultSet.java Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * add e2e ci * add param test * format * Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalSession.java Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update InternalSession.java * remove pb generated * fix doc * fix doc * fix doc * fix workflows * fix version * fix generator * fix maven action * fix: catch OSError in neug-cli readline history loading on macOS (#75) * fix: catch OSError in neug-cli readline history loading on macOS On macOS, Python's readline module is backed by libedit instead of GNU readline. When ~/.neug_history was written by a GNU readline session (e.g. from Docker/Linux), libedit raises OSError (errno 22 EINVAL) instead of silently handling the incompatible format. The original code only caught FileNotFoundError, causing neug-cli to crash on startup. Broaden the exception handler to also catch OSError so the history file is simply skipped, matching the intended behavior. Fixes #74 * fix: scope OSError catch to errno.EINVAL for libedit incompatibility Per greptile review: catching the full OSError base class could silently swallow unrelated errors such as PermissionError or IsADirectoryError. Narrow the catch to only suppress errno.EINVAL (22), which is the specific error raised by macOS libedit when it encounters a GNU readline history file. All other OSError variants are re-raised so users see genuine problems. Also add 'import errno' to top-level imports. * Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalDriver.java Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix getBigDecimal * Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalResultSet.java Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update tools/java_driver/src/main/java/com/alibaba/neug/driver/internal/InternalResultSet.java Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix getObject * feat: Support Export Query Results to JSON/JSONL file (#60) * support export arrow table to csv format Committed-by: Xiaoli Zhou from Dev container * export query response PB to csv format Committed-by: Xiaoli Zhou from Dev container * minor fix according to review Committed-by: Xiaoli Zhou from Dev container * fix according to review Committed-by: Xiaoli Zhou from Dev container * minor fix Committed-by: Xiaoli Zhou from Dev container * support export query results to json format Committed-by: Xiaoli Zhou from Dev container * minor fix Committed-by: Xiaoli Zhou from Dev container * remove 'newline_delimited' settings and detect jsonl format from path Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container * minor fix Committed-by: Xiaoli Zhou from Dev container * add export to json tests in CI Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container * Update extension/json/src/json_export_function.cc Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update extension/json/src/json_export_function.cc Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update extension/json/src/json_export_function.cc Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * minor fix Committed-by: Xiaoli Zhou from Dev container * minor fix Committed-by: Xiaoli Zhou from Dev container * refine extension tests anotation Committed-by: Xiaoli Zhou from Dev container * minor fix Committed-by: Xiaoli Zhou from Dev container * rename INSTALL_EXTENSIONS to CI_INSTALL_EXTENSIONS to avoid conflict Committed-by: Xiaoli Zhou from Dev container * refine json extension tests ci Committed-by: Xiaoli Zhou from Dev container * minor fix Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * remove bytearray * add codegraph-qa skill (#78) * fix: Fix default value support for all type of properties (#63) Refactor the default value support for storage, avoid exposing default_value on column and mmap_array --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix: Fix incorrect edge table state when transforming between bundled and unbundled (#28) Fix incorrect edge table state when transforming between bundled and unbundled, include special case for string properties * fix: make the dedup operator cover all column types (#80) * make dedup operator cover all column types * format * fix * Correct the is_optional interface behavior for certain columns (#90) * add a codegraph example (#87) Co-authored-by: Longbin Lai <longbin.lai@gmail.com> * add checkRowIndex * add update_was_null * update doc * fix * update doc * fix * Implement the iteration method for QueryResult * update query_result.md * update * update doc * format example * format --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Longbin Lai <longbin.lai@gmail.com> Co-authored-by: Xiaoli Zhou <yihe.zxl@alibaba-inc.com> Co-authored-by: BingqingLyu <bingqing.lbq@alibaba-inc.com> Co-authored-by: Zhang Lei <xiaolei.zl@alibaba-inc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #82
Greptile Summary
This PR extends the
DEDUPoperator to cover all column types by changinggenerate_dedup_offsetfrom avoidmethod (that terminated withLOG(FATAL)for unsupported types) to aboolmethod that returnsfalseto signal "use the generic hash-based fallback." It also adds the previously-missingMSVertexColumn::generate_dedup_offsetimplementation and removes the row_num parameter fromColumnsUtils::generate_dedup_offset.Key changes:
generate_dedup_offsetnow returnsbool;falsemeans "fall back to hash-based dedup" indedup.ccMSVertexColumngains a correctgenerate_dedup_offsetwith anull_seenguardArrowArrayContextColumn's entire type-dispatched dedup implementation is removed; it now silently falls back through the base-class default, which emitsLOG(ERROR)on every dedup — this will pollute production logs with false error messages for a valid code pathLOG(FATAL)→LOG(ERROR)acrossMSEdgeColumn,ListColumn,StructColumn, and the base class allows graceful fallback, butLOG(ERROR)is still too severe for an expected, handled code path;LOG(WARNING)or lower would be more appropriateColumnsUtils::generate_dedup_offsethelper has a harmless but redundantresizecall after the constructor already sets the correct sizeifbody indedup.cc(line 37–38) is valid C++ but makes the intent hard to read at a glanceConfidence Score: 3/5
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Dedup called] --> B{cols empty?} B -- yes --> C[Return ctx unchanged] B -- no --> D{Single column?} D -- yes --> E[Call generate_dedup_offset] E --> F{Returns true?} F -- yes --> G[Fast path: offsets populated] F -- "no - LOG ERROR fired" --> H[Clear offsets, use hash fallback] D -- "no, multi-column" --> H H --> I[Hash fallback: encode each row via get_elem] I --> J[offsets = unique row indices] G --> K[Build result Context] J --> K K --> L[reshuffle offsets] L --> M[Return ret] subgraph FastPathTypes["Fast-path column types"] FP1[SLVertexColumn - bitset] FP2[MSVertexColumn - set plus null flag - NEW] FP3[MLVertexColumn - set] FP4[PathColumn - sort dedup vector] FP5[ValueColumn - sort or set] FP6[SDSLEdgeColumn, BDSLEdgeColumn, SDMLEdgeColumn, BDMLEdgeColumn] end subgraph FallbackTypes["No fast path - hash fallback"] FB1[ArrowArrayContextColumn - base class LOG ERROR] FB2[MSEdgeColumn - LOG ERROR] FB3[ListColumn - LOG ERROR] FB4[StructColumn - LOG ERROR] endComments Outside Diff (1)
src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)generate_dedup_offsetfails mid-wayThe new control flow silently falls through to the slow-path
elsebranch whenevergenerate_dedup_offsetreturnsfalse. If a futuregenerate_dedup_offsetimplementation partially populatesoffsetsbefore returningfalse(e.g., due to an error mid-loop), theelsebranch would append additional entries onto the already-partially-populated vector, producing a corrupted, non-deduplicated result.All current
false-returning implementations happen to leaveoffsetsuntouched, so this is not a bug today. But the contract is nowhere documented: there is nooffsets.clear()guard at the top of theelsebranch, and no documented requirement that callers ofgenerate_dedup_offsetleaveoffsetsunchanged on failure.Consider adding a defensive
offsets.clear()at the start of theelsebranch, or documenting the contract thatgenerate_dedup_offsetmust not modifyoffsetson failure:src/execution/common/operators/retrieve/dedup.cc, line 37-54 (link)offsets.clear()in fallback branchWhen
cols.size() == 1andgenerate_dedup_offsetreturnsfalse, the code falls into theelsebranch. Currently all false-returning implementations leaveoffsetsuntouched (e.g.MSEdgeColumn,ListColumn,StructColumn), so theelseblock sees an empty vector and fills it correctly. However, theelseblock never explicitly clearsoffsetsbefore appending to it.If any future
generate_dedup_offsetimplementation partially populatesoffsetsbefore discovering a failure and returningfalse(which is perfectly reasonable), the stale entries would be mixed with the fresh ones produced by theelseblock, causing duplicate or incorrect result rows.The fix is to add
offsets.clear()at the start of theelseblock:include/neug/execution/common/columns/columns_utils.h, line 43 (link)row_num == 0offsets.push_back(row_indices[0])is called unconditionally. Ifrow_numis zero,row_indicesis empty androw_indices[0]is undefined behaviour. Although this is pre-existing code, this PR adds several new callers that now returntrueand reach this path (PathColumn,BDSLEdgeColumn,SDMLEdgeColumn,BDMLEdgeColumn), widening the exposure. A guard is needed:Last reviewed commit: "fix"